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In this work we present RESCU, a powerful MATLAB-based Kohn-Sham density functional theory 
(KS-DFT) solver. We demonstrate that RESCU can compute the electronic structure properties 
of systems comprising many thousands of atoms using modest computer resources, e.g. 16 to 256 
cores. Its computational efficiency is achieved from exploiting four routes. First, we use numerical 
atomic orbital (NAO) techniques to efficiently generate a good quality initial subspace which is 
crucially required by Chebyshev filtering methods. Second, we exploit the fact that only a subspace 
spanning the occupied Kohn-Sham states is required, and solving accurately the KS equation using 
eigensolvers can generally be avoided. Third, by judiciously analyzing and optimizing various parts 
of the procedure in RESCU, we delay the 0{N^) scaling to large N, and our tests show that RESCU 
scales consistently as 0{N^'^) from a few hundred atoms to more than 5,000 atoms when using a 
real space grid discretization. The scaling is better or comparable in a NAO basis up to the 14,000 
atoms level. Fourth, we exploit various numerical algorithms and, in particular, we introduce a 
partial Rayleigh-Ritz algorithm to achieve efficiency gains for systems comprising more than 10,000 
electrons. We demonstrate the power of RESCU in solving KS-DFT problems using many examples 
running on 16, 64 and/or 256 cores: a 5,832 Si atoms supercell; a 8,788 A1 atoms supercell; a 5,324 
Cu atoms supercell and a small DNA molecule submerged in 1,713 water molecules for a total 5,399 
atoms. The KS-DFT is entirely converged in a few hours in all cases. Our results suggest that the 
RESCU method has reached a milestone of solving thousands of atoms by KS-DFT on a modest 
computer cluster. 

PACS numbers: 31.15.E-, 71.15.-m, 02.70.Bf, 31.15.xr, 


I. INTRODUCTION 


Density functional theory (DFTjP based numerical 
programs are nowadays the standard tool for predicting 
and understanding the structural and electronic proper¬ 
ties of materials that involve many electrons. The idea of 
treating complicated many-body interactions in real ma¬ 
terials by a self-consistent mean field theory appeared in 
the early days of quantum mechanics. In 1 927 , Thomas 
and Fermi proposed a semiclassical modeP^^ in which 
electrons in an external potential are described using only 
the electronic density. Subsequent calculations are sim¬ 
plified since the complicated many-body wavefunction is 
avoided. The Thomas-Fermi model was later improved 
by Dirac who included an exchange energy functional 
and by von Weizsacker who added a gradient correction 
to the kinetic energy functionaP. Nearly four decades 
later, Hohenberg and Kohn (HK) put DFT on firm the¬ 
oretical footing by proving that the ground-state expec¬ 
tation values are functionals of the density and that the 
ground-state density can be calculated by minimizing an 
energy functional Certain assumptions of the original 
HK theorems, such as the ground-state non-degeneracy, 
were later relaxed or eliminatecP. These theories proved 
that the ground-state properties of any electronic sys¬ 
tem can in principle be calculated - if not necessarily 
understood - without using many-body wave functions. 
For practical applications, Kohn and Sham (KS) demon¬ 
strated that the problem of minimizing the total energy of 
a system with respect to the electronic density could take 
the form of a non-interacting electron probleirP. In the 


KS formulation, the kinetic energy is evaluated via single 
particle wave functions which is more accurate than using 
kinetic energy functionals that depend explicitly on the 
density. KS-DFT allows one to analyze a variety of phys¬ 
ical systems and performing a DFT calculation today is 
all but synonymous to solving the KS equatioiP. 

Various approaches for solving the KS equation h ave 
emerged such as the full potential all-electron method^^ 
and the ab initio pseudopotential methodJi^HIll, jn 
KS-DFT solvers, several bases have been used to ex¬ 
press quantum mechanical operators including real space 
Cartesian grids, finite elements, planewaves, wavelets, 
numerical atomic orbitals (NAO), Gaussian orbitals, 
muffin-tin orbitals and some others. The goal is to pre¬ 
dict structural and electronic properties of real materi¬ 
als reaching the required accuracy for the given research 
topic and KS-DFT is playing a prominent role in mate¬ 
rials physics and engineering. 

At present, a major issue of practical DFT methods 
is their limited capability of solving material problems 
involving large number of atoms using a small computer 
cluster (e.g. 16 to 256 cores). For instance, algorithms 
implem ented in state-of-the-art electronic packages such 
as VASFG^EH and Ablnili^can comfortably solve systems 
comprising of a few hundred atoms - but not many thou¬ 
sands on such small computer clusters. With the widen¬ 
ing accessibility of supercomputers and the developments 
of advanced parallel computing algorithms, heroic KS- 
DFT calculations at the level of 10,000 atoms became 
possible in recent years, but at the expense of using thou¬ 
sands or even tens of thousands of computing corefl^®^. 
Nevertheless, for practical material research and innova- 
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tion, many small research groups in the world do not 
have access, cannot afford or simply wish not to use su¬ 
percomputers. An urgent and very important task is to 
develop a KS-DFT method that can solve the KS equa¬ 
tion without degrading the solution accuracy, at the level 
of several thousand atoms or more on a small computer 
cluster. It is the purpose of this work to report and de¬ 
scribe such a KS-DFT solver and its associated software 
implementation. 

To see why it is still possible to gain computational 
efficiency in traditional eigenvalue-based KS-DF T ap- 
proaches, we note - as others had noted before us^SEfil 
- that the solution process of the KS-DFT is a self- 
consistent procedure where one numerically converges 
the Hamiltonian step by step by solving the KS equa¬ 
tion repeatedly and accurately. However, it appears un¬ 
clear why one has to solve accurately the KS equation for 
the not-yet-converged Hamiltonian in the intermediary 
steps. Another observation is that, in the eigensolver- 
based KS-DFT methods, different parts in the compu¬ 
tation scale differently as a function of electron number 
N, some 0{N), others 0{N‘^) and eventually these are 
dominated by the 0{N^) parts. If one is able to “delay” 
the crossover to 0{N^) scaling, larger systems can poten¬ 
tially be solved using small computers. It turns out that 
these computational gains can be realized as we present 
below. 

Our KS-DFT method combines NAO and the real 
space finite-differences plus Cheb yshev filtering (CF) 
technique introduced by Zhou et We found it is 

key to generate efficiently a proper initial subspace in the 
Chebyshev filtering framework, and this is achieved by 
the use a NAO basis. We advance efficient parallelization, 
a partial Rayleigh-Ritz (pRR) method for the computa¬ 
tion of the density matrix and careful optimization of the 
solution process, and we have reached our goal of solving 
solid state physics problems consisting of thousands of 
atoms using 16 to 256 cores. Our code is called RESCU - 
which stands for Real space Electronic Structure Calcu¬ 
lator - and it is implemented in the technical computing 
platform MATLAB. We use our own MPI and ScaLA- 
PACK interfaces to harness efficiently the computational 
power of the cores. As such, RESCU combines the vo¬ 
cations of a prototyping code and a production code. In 
particular, the pRR allows us to compute the single par¬ 
ticle density matrix in problems involving an exceedingly 
large number of electrons by taking advantage of the 
quasi-minimal property of basis sets built from CF. In 
the present paper, we will present the algorithmic and im¬ 
plementation advancements achieved during the develop¬ 
ment of RESCU. As practical examples, we demonstrate 
the following KS-DET calculations: we simulate 5,832 Si 
atoms (23,328 electrons) on a real space grid, converging 
the entire KS-DFT calculation using 256 cores for about 
5.5 hours; we simulate 4,000 A1 atoms (12,000 electrons) 
on a real space grid, converging the entire KS-DFT cal¬ 
culation using 64 cores for about 5.1 hours; we simulate a 
supercell consisting of 13,824 Si atoms (55,296 electrons) 


using a NAO basis, converging the entire calculation us¬ 
ing 64 cores for about 6.4 hours; we simulate a supercell 
consisting of 5,324 Cu atoms (58,564 electrons) using a 
NAO basis, converging the entire calculation using 256 
cores for about 12 hours. We also consider a disordered 
system consisting of a small DNA molecule submerged in 
1,713 water molecules, for a total of 5,399 atoms (14,596 
electrons), and converge the entire KS-DFT run in 9.6 
hours on 256 cores. These results are compiled in ta¬ 
ble |H] which is found in section |VH[ The scaling of the 
RESCU method is presented going from 16 cores to 256 
cores for various tests. Finally, since RESCU is primar¬ 
ily a real space implementation of KS-DET, it does not 
require periodicity when dealing with condensed phase 
materials and can thus easily treat problems involving 
interfaces, surfaces, defects, disordered materials, etc. 

The paper is organized as follow. In section [TTj we 
briefly state the fundamentals of DET and introduce 
the single particle density matrix theoretical framework 
which is used throughout this article. In section [TTTl we 
review the state-of-the-art numerical methods for solving 
the KS equations and recount their advantages and dis¬ 
advantages. In section [Tv] we describe in some detail the 
Chebyshev filtering method. In section jV] we present a 
computational complexity analysis of the Chebyshev fil¬ 
tering method and introduce the partial Rayleigh-Ritz 
algorithm. We explain how it takes advantage of the 
Chebyshev filtered basis sets to improve on the standard 
Rayleigh-Ritz algorithm. In section lYll we describe the 
implementation of the Kohn-Sham DET solver RESCU. 
In section |VH[ we present different benchmarks of the 
RESCU code. We provide evidence when the partial 
Rayleigh-Ritz algorithm achieves significant gains over 
the standard Rayleigh-Ritz algorithm. We show how to 
generate a good quality initial subspace efficiently. Ei- 
nally, we report simulations including thousands of atoms 
with modest computer resources. Bottlenecks and future 
direction will be discussion in section IVHII and jlXj 


II. A BRIEF DISCUSSION OF DFT 

Before delving into the details of the RESCU method, 
we briefly discuss KS-DFT in general terms. As men¬ 
tioned in the introduction, the founding result of DFT is 
that the Hamiltonian of a system is uniquely determined 
by its ground-state electronic densitj®^. It follows that 
the ground-state wave function and the associated ex¬ 
pectation values are also determined by the ground-state 
density, and there exists a universal energy functional 
of the den sity which is minimized by the ground-state 
densitjffJ^. 

In the KS-DFT, the problem of minimizing the en¬ 
ergy with respect to the density is mapped to a non¬ 
interacting electron problerrP. The KS-DFT formulation 
made it possible to develop reasonably accurate energy 
functionals and it became the most successful and widely 
applied flavor of DFT. In particular, accurate kinetic en- 
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ergy functionals use the Kohn-Sham orbitals and not the 


density per se. The Kohn-Sham equation is 
ten as the the following set of equations 

usually writ- 

AiV^i — ^ext + Vh + Uxc^ ipi (1) 

00 

P(i’) = '^nFD{Xi,p)^l)*(y)'il)i{v) 

(2) 

= 47rp 

(3) 

^ ^ 5Exc \_p\ 

^ XC — c 

dp 

(4) 


The density p is the sum of the squared norm of the 
Kohn-Sham wave functions weighted by the Fermi-Dirac 
distribution. At zero temperature, the only populated 
states are the N lowest lying states where N is the num¬ 
ber of electrons in the system. The Hartree potential 
Vh may be obtained by solving the Poisson equation, 
the exchange-correlation potential Vxc is defined as the 
functional derivative of the exchange-correlation energy 
functional E^c with respect to the density, npo denotes 
the Fermi-Dirac distribution and the chemical potential 
p, is set such that the number of electrons is N. The 
equations are written in atomic units, we denote the 
Hamiltonian H = — -I- V^xt EVh + Vxc\ its dimen¬ 

sion, which corresponds to the number of real space grid 
points or the number of k-space grid points, is M and 
the eigenvalues are indexed from smallest to largest as 
follows Ai < A 2 < ...Am -1 < Am- Unlike the Schrbdinger 
equation, the KS equation is non-linear since the poten¬ 
tial depends on the density which in turn depends on 
the KS eigenstates. Consequently, the KS equation must 
be solved by cycling through Eqs.Q to 0 until a fixed 
point p* is found although other convergence criteria may 
be used. 

We introduce a more flexible framework in which 
Eqs. 0 to 0 are expressed in terms of the single particle 
density matrix defined as follows 


where ■i/'i satisfies Eq.Q. The density matrix is then 
expressed as follows 

P = (7) 

where A.y = XiSij. The Kohn-Sham eigenstates are ex¬ 
pressed in terms of basis functions {(j)i} as done in the 
following equation: 


i 

which translates as follow in matrix notation: 


'4' = $C . 


(9) 


Inserting ([^ in Q we obtain 

P = $CnFD(A,/i)C^^^ • (10) 

The matrix C satisfies a generalized eigenvalue equation, 
HC = SCA (11) 


where H = and S = is the overlap ma¬ 

trix. Since the overlap matrix is symmetric positive def¬ 
inite, it is possible to calculate its Cholesky decomposi¬ 
tion S = U U. Eq.( 11) is usually solved by reducing the 
generalized eigenvalue problem to a standard eigenvalue 
problem 


HC = CA 


( 12 ) 




where H = U HU and C = UC. Plugging Eq.([^ 
in (10) we obtain: 


P = $U ^nFZ>(H,^)U 




(13) 


It appears that as long as is a subspace of the den¬ 
sity matrix is unchanged and so is the electronic density. 
Note that the chemical potential p satisfies 


00 

= '^nFD{Xi, p)'ilj*{r)i;^{r') . (5) 

i=l 

Note that the density is simply the diagonal of the sin¬ 
gle particle density matrix. The Fermi-Dirac distribu¬ 
tion nFo{\,p) decays exponentially fast for X > p. It is 
thus reasonable to set the occupation number to zero if 
npoiXi, p) < e where e is some tolerance. Consequently, 
the number of required Kohn-Sham states L is equal to 
or slightly larger than the number of electrons N and it 
is much smaller than the linear dimension of the Hamil¬ 
tonian matrix M. In other words, the single particle 
density matrix is a low rank matrix. It is convenient to 
rewrite Eq. 0 using matrix notation. To that end, we 
define the populated Kohn-Sham eigenspace as 


( 6 ) 


N = Tr 


ufoCH-.p) 


(14) 


We shall refer to the quantity P, defined in Eq. (15) 
below, as the projected density matrix even though P 

$tp$. 


P = U \FDiil,p)U ^ 


(15) 


III. PRACTICAL ALGORITHMS FOR SOLVING 
THE KS EQUATION 

The purpose of this section is to review briefly practi¬ 
cal algorithms used in state-of-the-art and recent DET 
solvers, with the focus on elucidating where and how 
one may achieve speed-ups so that larger systems can 
be solved by KS-DFT using a modest computer. 


' 5 ' = 
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Using the definitions introduced in section [T^ we now 
reformulate the KS equation as described in the generic 
Kohn-Sham solver Algorithm below. Firstly, the elec¬ 
tronic density is initialized and the dual Hamiltonian 
generated. The density is often initialized using the iso¬ 
lated atom densities but other choices, a uniform density 
for example, are viable in certain systems. A subspace 
which spans approximately the occupied Kohn-Sham 
subspace '5' is generated. Then the Hamiltonian and 
identity operators are projected onto the subspace 
The projected density matrix is then evaluated, typically 
by solving the matrix equation © or ( |I^ . Next, the 
density is obtained from the diagonal of the real space 
single particle density matrix and the Hamiltonian is up¬ 
dated by solving the Poisson equation and evaluating 
the exchange-correlation potential. This step is gener¬ 
ally preceded by a mixing of the density or followed by a 
mixing of the potential. Finally, the convergence of the 
density and possibly other quantities are monitored. 


Algorithm 1 Generic Kohn-Sham solver 
procedure GenericSolver((5) 

Initialize po, H[po] 

while e > 5 OT k < kmax do 

Compute a subspace which spans 

— k 

Compute the projected Hamiltonian H and the 
overlap matrix S 

— k 

Compute the projected density matrix P 
Compute the density p*(r) = P*^(r,r) 

Compute H = H[pfc] 

Calculate e = \\pk - Pfe-i||, e = llH[pfc] - H[pfe_i]|| 

return pk 


Many currently used DFT codes fit in the framework 
established in Algorithm [l] They generally differ in 

how to calculate the subspace and how to compute 

_ ^ 

the projected density matrix P : these are the foci of 
the most recent algorithmic advancements and probably 
those to come as we shall mention later. We now dis¬ 
cuss how particular DFT methods translate in the above 
picture. 

The procedure executed by state-of-the-art DFT 
solvers is summarized in Algorithm below. The eigen¬ 
vectors of the Kohn-Sham Hamiltonian are calculated di¬ 
rectly which makes the rest of the procedure simple. The 

Kohn-Sham states diagonalize the Hamiltonian such that 
_^ 

H is diagonal and the overlap matrix is the identity I by 
virtue of the orthogonality of the eigenvectors. Calculat- 

ing the Fermi-Dirac operator npjjCH. , fj,) = nFD{A , p.) 
is then trivial. At zero temperature, N Kohn-Sham 
states are required if there are N electrons in the system 
(A/2 if there is spin degeneracy). When using thermal 
smearing, more states are required since the states whose 
energy is close to p are fractionally occupied. As men¬ 
tioned above, the density matrix has a low rank since 
p) decays exponentially fast for X > p. It is thus 
generally sufficient to compute L = N + N^uf KS states 


where is a modest number. These L KS states 

can be thought of as forming a KS basis that diago¬ 
nalize the KS Hamiltonian, and this Kohn-Sham basis 
is deemed quasi-minimal since L ~ A. The number of 
Kohn-Sham basis functions is typically smaller than the 
dimensionality of the discretized Hamiltonian by a few 
orders of magnitude and therefore it is advantageous to 
use partial diagonalization methods such as the Arnold! 
algorithm or the Lanczos algorithm to compute the re¬ 
quired eigenvectors. These algorithms are implemented 
in established software libraries such as ARPACKpS and 
TRLAfJ^ which are used by many DFT solvers. 


Algorithm 2 Diagonalization Kohn-Sham solver 
procedure DiagSolver(5) 

Initialize po, H[po] 

while e > 5 OT k < kmax do 

Compute the occupied Kohn-Sham subspace 
The projected Hamiltonian is and the overlap 
matrix I 

The projected density matrix is npr){A’°, p) 
Compute the density p^'^^{r) — P*^(r, r) 

Compute H = H[pfc] 

Calculate e = \\pk - Pfc-i||, e = llH[pfc] - H[pfc_i]|| 
return pk+i 


Even sparse diagonalization techniques are very com¬ 
putationally demanding if a lot of occupied states must 
be computed. As already mentioned at the end of the 
last section, the Kohn-Sham states are actually not nec¬ 
essary and a subspace which approximately spans the oc¬ 
cupied Kohn-Sham subspace may solve the KS e quatio n. 
For example, basis sets such as atomic orbit ala^^l^SlEIl 
and Gaussian orbital^^ have been used extensively in 
the DFT community. The procedure using a predefined 
basis set is summarized in Algorithm 

The main disadvantage of such methods is the diffi¬ 
culty to systematically augment the basis to improve the 
simulation accuracy or validate convergence. Many re¬ 
search groups have put forward methods to generate ba¬ 
sis sets that can achieve a system atic convergence for 
most elements from H to Ri ji4 | 23 | 2 7iBni Well established 
DFT codes using atom-centered basis functions such as 
SIESTAISI, FHI-AIM^ or GaussiaiP^ provide tested ba¬ 
sis sets and have been used extensively by researchers to 
study physical systems with a variety of chemical en¬ 
vironments. Nevertheless, some systems (e.g. certain 
metals, dense structures, solids with large coordination 
numbers) may require special treatment where new basis 
functions must be generated and tested. This is in con¬ 
trast with the procedure described in Algorithm where 
the Kohn-Sham states accuracy is only determined by the 
underlying numerical grid, which makes the convergence 
with respect to the basis set relatively more transparent 
and straightforward. 

On the other hand, predefined basis sets have many 
computational advantages. In the scheme of Algorithm 

the subspace needs not be updated at every step and 
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hence it is generated ahead of the self-consistent loop. 
Other quantities such as the projected kinetic energy ma¬ 
trix, the projected non-local ionic potential matrix and 
the overlap matrix are also computed ahead of the itera¬ 
tive process. Only the diagonal part of the Hamiltonian 
corresponding to the Hartree and exchange-correlation 
potentials has to be updated and projected onto the sub¬ 
space $. Among other advantages, predefined basis sets 
are often localized by design and the projection of diag¬ 
onal operators scales linearly with respect to system size 
and has a relatively cheap computational cost. The basis 
functions may also have spherical symmetry which makes 
it possible to perform certain integrals analytically. The 
main bottleneck is the computation of the projected den¬ 
sity matrix which is usually obtained by diagonalizing the 

reduced projected Hamiltonian H . Whereas these ba¬ 
sis sets are not minimal, their dimension is generally a 
modest multiple of the number of electrons. For systems 
with less than a thousand atoms or so, these methods are 
competitive as the matrix H — AS is directly invertible 
or diagonalizable. In larger systems, it becomes crucial 
that the projected Hamiltonian matrix be made as small 
as possible, and these methods become less competitive. 


Algorithm 3 Orbital Kohn-Sham Solver 
procedure OrbitalSolver((5) 

Initialize po, H[po] 
while e > S or k < kmax do 
The subspace $ is constant 

Compute the projected Hamiltonian H ; the overlap 
matrix S is constant 

Compute the projected density matrix P 
Compute the density p*"'"^(r) = P*^(r,r) 

Compute H = H[pfc] 

Calculate e = \\pk - Pfe-i||, e = l|H[pfe] - H[pfc_i]|j 
return pk+i 


A way around this issue is to use polynomial approx¬ 
imations of the Fermi-Dirac operator or other opera¬ 
tors simulating its effect. Goedecker and Colombo have 
used polynomial approximations of the Fermi operator!^ 
Goedecker and Teter have used Chebyshev polynomial 
approximations of the complementary error functiorP^to 
simulate the action of the Fermi operator; Jay et al. used 
Chebyshev-Jackson polynomial expansions of the Heav¬ 
iside functioiP^ etc. These techniques are free of diag- 
onalization but have bottlenecks and limitations of their 
own. Polynomial approximations work only at finite tem¬ 
perature since the Fermi-Dirac distribution is discontin¬ 
uous at T = 0. Another disadvantage is that the degree 
of the polynomial must scale as 0{f3cj), where (3 is the 
inverse temperature and a is the valence spectral width, 
to achieve a given accuracy. The occupied part of the en¬ 
ergy spectrum usually spans many eV and even tens of 
eV whereas room temperature corresponds to an energy 
of 0.025 eV - these very different energy scales demand 
very high order expansions. 

It is also possible to compute the density matrix from 


single particle Green’s function using the following for¬ 
mula: 

P=^/ dXG{X) (16) 

— oo 

where G = (A — H)~ as proposed by Baroni and P. 
GiannozzP^. Like polynomial expansions, rational ap¬ 
proximations generally require a lot of terms to achieve 
a decent accuracy. The issue has been addressed by 
Lin and coworkers who developed a multipole e xpan- 
sion method which scales as O log(/3cr) log(log(/3tT))pSESl^ 
Their method was combined with the par allel s elective 
inversion algorithm developed by Lin et to per¬ 

form electronic structur e calc ulations of systems compris¬ 
ing thousands of atom^^aMO] rational expansion of 

the Fermi-Dirac operator leads to an inverse intensive 
method which has some disadvantages compared with 
diagonalization techniques. It is usually more difficult to 
achieve a good load balance and memory distribution in 
inverse algorithms. Moreover, the complexity worsens as 
the dimensionality of the system increases from one di¬ 
mension (ID) to three dimensions (3D). This is due to 
the filling of the matrix factors which becomes problem¬ 
atic in 3D problems from both memory and processing 
perspectives. If the inverse is available, spectrum slicing 
and shift-and-invert eigensolvers may also be used to find 
a large number of eigenvectors efficientljE^. 

Having briefly reviewed the various algorithms in prac¬ 
tical KS-DFT implementations, we present a method 
which focuses on building a quasi-minimal basis set 
in the following subsection. The adaptive basis 
set is constructed via the Chebyshev filtering tech¬ 
nique fir st applied by Zhou, Chelikowsky, Saad and 
coworkerglSEniiffial^ workflow is presented in Algo¬ 
rithm]^ First, the basis set from the previous iteration 
is refined using Chebyshev filtering. A Chebyshev filter 
is an operator which is expressed as a first kind Cheby¬ 
shev polynomial of the Hamiltonian matrix. It is eas¬ 
ily evaluated as it only requires matrix-vector products. 
The size of the subspace is usually comparable to N and 
is thus significantly smaller than the size of fixed basis 
sets (e.g. Gaussian orbitals or atomic orbitals). It can 
also be systematically refined and its accuracy is only 
limited by the underlying grid resolution. The down¬ 
side is naturally that its computation comes at a cost 
and the resulting subspace is generally dense such that 
the memory requirement scales as 0{N‘^). The scheme 
is generally more costly than DFT calculations using or¬ 
bital bases but much lighter than the plain diagonaliza¬ 
tion schemes. RESCU implements both Algorithms]^ for 
atomic orbitals and Algorithm ]^ for real space grids. 

IV. CHEBYSHEV FILTERING 

In this section, we describe the Chebyshev filter¬ 
ing procedure introduced by Zhou et al. for KS-DFT 
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Algorithm 4 CFSI Kohn-Sham Solver 
procedure CFSISolver(5) 

Initialize po, H[po] 

while e > S or k < kmax do 

Compute a subspace using the 

Chebyshev filtering 

Compute the projected Hamiltonian H and the 

—k 

overlap matrix a 

Compute the projected density matrix P 
Compute the density p*'''^(r) = P*^(r,r) 

Compute H = H[pfc] 

Calculate e = \\pk - Pfc-i||, e = llH[pfc] - H[pfc_i]|| 
return pk+i 


calculationJi^ and its application in the RESCU method. 
Already in 2006, the simulation of a Si 904 iHi 86 o nan¬ 
ocluster was reported in Refl20] using this method plus 
computational acceleration by symmetry considerations. 
The technique concentrates on building a subspace $ 
which spans ^ as defined in Eq.([^. A suitable approxi¬ 
mation for '4' is generated and thereafter rotated toward 
only using Hamiltonian-subspace products. We de¬ 
note the subspace at the self-consistent iteration 
To illustrate how this works, we shall refer to Fig{T] The 
dimension of is L = N^c + Nfr, where N^c is the num¬ 
ber of fully occupied states {npo > 1 — e) and Nfr is the 
number of fractionally occupied states (1 —e > npu > 0). 

Consider a vector (p G Since the Kohn-Sham 

eigenvector basis is complete we can write 

4> = . (17) 

If we apply a spectral filter ^^(H) to p, we change the 
composition of the vector p in the following way: 

= (18) 

= '^aiTn{Xi)ilJi . (19) 

Suppose that Tn{Xi) ^ Tn{Xj) for i G {1,...,L} and 
j G {L + 1,..., M}, then Tn{H)(j) has a much larger over¬ 
lap with '4' than p. Applying such a filter to a whole 
subspace will result in a steering of toward '4'. 
For reasons evoked previously, we would like to avoid di- 
agonalization and inversion of the Flamiltonian so that 
polynomial filters are an evident choice. We seek a poly¬ 
nomial that assumes large values in the interval [Ai,Ai,] 
and small values in the interval [Al+i,Am]- Many poly¬ 
nomials satisfy this property but the Chebyshev polyno¬ 
mials of the first kind (denoted T„ where n is the degree) 
have the minimal oo-norm on the interval [—1,1] among 
monic polynomials, and they grow exponentially in the 
degree n outside [—1,1] making them the ideal candidate. 
The 8*^ order Chebyshev polynomial of the first kind Tg 
is plotted on a logarithmic scale in Fig. [1] The polyno¬ 
mial oo-norm is bounded by 1 in the interval [—1,1] and 
strictly greater than 1 outside the interval [—1,1]. 


r-00 



FIG. 1: 8*^ degree Chebyshev polynomial of the first kind 
Ts{x) as a function of x. The y axis is logarithmic. The green 
region is mapped to fully occupied Kohn-Sham states. The 
yellow region is mapped to the fractionally occupied states 
(mostly unoccupied states). The [-1,1] interval is mapped to 
the unoccupied Kohn-Sham states. |rg(x)| < 1 in the interval 
[-1,1] and hence applications of the Chebyshev filter suppress 
the unoccupied components. The interval [l,oo] is mapped 
above the spectrum and hence do not contribute. 


In general, the unoccupied energies do not correspond 
to the interval [—1,1]. In order to use T„ as a filter, we 
must apply an affine transformation which maps the in¬ 
terval [-00,-1] to [— oo,Al] and [—1,1] to [Ai+ijAyf]. 
In this way, the components of the occupied spectrum 
are assuredly magnified with respect to the components 
of the unoccupied spectrum. This requires the knowledge 
of the lower and upper bounds Xl and Am- The lower 
bound of the unoccupied spectrum Xl can be estimated 
from the largest Ritz value of which is easily obtain¬ 
able. The upper bound of the spectrum can be estimated 
from a few step of the Lanczos algorithm. Estimates for 
the eigenvalue errors are derived in Te mpl ates for the so¬ 
lution of algebraic eigenvalue problems^ and Zhou has 
studied the accuracy and robustness of a few estimators 
for the upper bound of the spectruir^. 

Now, consider a system of N electrons at zero tem¬ 
perature such that Noc = N. The rate of convergence is 
roughly T„(AAr) since |T„(Ai)| > |T„(Aiv)| > 1 > |r„(Aj)| 
for i G iV} and j G {L + The occu¬ 

pied part of the spectrum is thus magnified by at least 
Tri(AAr) at every filter application with respect to the 
unoccupied spectrum. However, if A a? ~ Ai+i then 
|7n(AAr)| > 1 > |T„(A 7 v+i)| and the convergence rate 
TniXN) — 1 is disappointing. It is thus crucial to include 
enough fractionally occupied states in the subspace to 
separate the occupied part of the spectrum and the un¬ 
occupied part of the spectrum. The fractionally occupied 
Kohn-Sham states converge slowly but it does not matter 
since they contribute little if at all to the electronic den¬ 
sity. The occupied, fractionally occupied and unoccupied 
sections of the spectrum are represented by the green, 
yellow and red region respectively in Figj^ The white 
region does not map to any state since the largest eigen¬ 
value is mapped below 1. A pseudocode detailing the 
application of a Chebyshev filter is found in Algorithm 

There, we take advantage of the fact that Chebyshev 
polynomials of the first kind obey the recursive relation 
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Tn+i{x) = 2xTn{x) — Tn-i{x). Alternatively, the co¬ 
efficients can be directly computed using the following 

formulcP 


Tn{x) 


n f n-2fc 

4 ; fc!(n-2fc)! 


( 20 ) 


which allows the filtering implementation to use one less 
temporary vector. 


Algorithm 5 Chebyshev Filtering 
procedure ChebFilter)?!, Am, Al, ipo, H) 

Compute the affine transformation parameters e = 
(Am — A_l)/ 2 and c = (Am -f A_l)/2. 
ipi = (Hi/Jo - ctpo)le 
for i = 2, ...,n do 

ip2 = 2(Hi/>i - - ipo 

tpo = ipi, V'l = V’2 
return %j }2 


If Nfr was equal to 0, we would only need to orthonor- 
malize the subspace and the projected density matrix 
P could be assumed to be the identity and therefore the 
density would then be trivial to evaluate. This is because 
the density is invariant under unitary transformations of 
the occupied Kohn-Sham subspace. However, using ex¬ 
tra Kohn-Sham states is generally necessary to obtain 
a robust convergence for reasons we just mentioned and 
the density must not include the contribution from the 
unoccupied states. For very large systems comprising 
tens of thousands of electrons, certain quantities such as 
the total energy may not be affected significantly by in¬ 
cluding a few unoccupied states in the density. Whether 
this is true depends on the problem and the precision 
target but in general Chebyshev hltering is followed by 
the Rayleigh-Ritz procedure in order populate the Kohn- 
Sham states correctly. The Rayleigh-Ritz procedure is 
summarized in Algorithm It is essentially a procedure 
that orthonormalizes the subspace and computes the 
projected density matrix. It scales as 0{N^) and it is 
the main bottleneck in large scale computations in the 
Chebyshev filtering scheme. Note that it is possible to 
first orthonormalize and then solve a standard eigen¬ 
value problem or use the non-orthonormal and solve 
a generalized eigenvalue problem. We have observed that 
the latter is generally slightly faster overall. 

The Chebyshev filtering technique introduced above 
can be used with orthonormal discretization schemes 
such as hnite-differencing and plane-waves. It may also 
be used on the projected Hamiltonian even in the case 
where the overlap matrix is not the identity. Implementa¬ 
tions using Chebyshev hlterin g with non-orthonormal ba¬ 
sis sets such as hnite-elementa^ZHU^ projector-augmented 
waves (PAW)P^ and full-potential linearized augmented 
planewaves (FLAPW)l^have been reported in the litera¬ 
ture. We will describe how the technique also benehts ba¬ 
sis sets methods such as NAO. As mentioned above, the 
NAO basis is static and localized. This leads to a sparse 


Algorithm 6 Rayleigh-Ritz procedure 
procedure RayleighRitz(H,$) 

Compute H = 
if A I then 
Compute S = 
else 

S = I _ 

Diagonalize HC = SCA 
Compute ^ 

Compute P = nFD(A,/i) 

return P 


representation for H and S but their size is not mini¬ 
mal such that the eigenvalue problem of the Rayleigh- 
Ritz procedure is rather large. The Chebyshev hltering 
technique can construct and maintain an eigensubspace 
which satishes, to a good approximation, Eq.(ll|. 


The matrix pencil H — AS cannot be used directly so 
that one must transform the generalized problem to a 
standard one. It is crucial to preserve to a point the 
sparsity of the matrix pencil since otherwise the required 
matrix operations do not have a signihcant advantage 
over dense diagonalization algorithm which are based on 
QR-decompositions. One option is to rewrite Eq.(ll) as 
follows: 


S ^HC = CA 


( 21 ) 


and to apply a hlter T„(S ^H). The operator S is 


no more Hermitian but in principle Eqs.(ll) and (21) 


are equivalent and this should pose no problem. One is¬ 
sue is that S is generally dense and hence the matrix- 
vector products are computationally costly. In quasi-lD 
or quasi-2D systems, the Cholesky factor U may still be 
relatively sparse depending on the system and the or¬ 
dering of the overlap matrix. In this case, we suggest 
considering Eq.(12) instead. The reduced Hamiltonian 

_ 'J' _ Y 

H = U HU can be used in the filter to compute 
which yields C^'. Each product then necessitates solv¬ 
ing two triangular systems of equations and one sym¬ 
metric matrix product. In conclusion, the efficiency of 
the Chebyshev filtering method vitally depends on the 
capacity to invert the overlap matrix in the context of 
NAO. There is so far no versatile and effective method 
to solve this issue. 


V. THE PARTIAL RAYLEIGH-RITZ 
PROCEDURE 

We begin this section by analyzing the computational 
complexity of the different operations performed during 
the self-consistent procedure in KS-DFT. 

Applying a Chebyshev hlter consists essentially in ma¬ 
trix products and the procedure scales as 0{MN + N^) 
where M is the size of the discrete Hamiltonian H 
(i.e. the number of grid points) and N is the num¬ 
ber of occupied Kohn-Sham states (recall L ~ N). 

















Procedure 

Scaling 

Compute # r„(H)$ 0{MN) 

Orthonormalize ^ 

0{MN^) 

Compute H and S 

o\mn^) 

Solve HC = SCA 

o\n^) 

Compute # $C 

oImn^) 


TABLE I: List of the most computationally expensive proce¬ 
dures in solving the Kohn-Sham equations and the associated 
computational complexities. M is the size of the Hamiltonian 
matrix and N is the number of occupied Kohn-Sham states. 


The 0{MN) scaling comes from applying the Lapla- 
cian and the 0{N'^) term comes from applying the 
Kleinman-Bylander projectors in dealing with the nonlo¬ 
cal pseudopotential^. 

Next, the Rayleigh-Ritz procedure scales as 0{MN'^ + 
N^) but its computational cost does not dominate until 
quite large system sizes as we shall demonstrate in Sec¬ 
tion |VII| We further decompose the complexity of the 
Rayleigh-Ritz procedure in the following four operations: 
subspace orthonormalization, Hamiltonian (and identity) 
projection, eigenvalue solution and computation of the 
Ritz vectors. The scaling of the most computationally 
expensive steps in solving the Kohn-Sham equations is 
displayed in Table The non-orthonormality of $ fol¬ 
lowing the Chebyshev filtering procedure can be taken 
into account by the Rayleigh-Ritz procedure, and there¬ 
fore the cost of orthonormalization can be absorbed in 
the diagonalization cost. The complexity for computing 
H and S, and computing $ := $C is 0{MN'^). The 
former is more computationally expensive since it is a 
{N X M) X (M X N) matrix product (where M ^ N) 
whereas the latter is a (Mx N) x {NxN) matrix product. 
We identify the computation of H and S, the eigenvalue 
problem HC = SCA and the computation of the Ritz 
vectors as the three principal bottlenecks in a large scale 
KS-DFT computation. 

The obvious way to address the first and third bottle¬ 
necks in Table |l] is to construct a localized basis for 
which leads to a sparse matrix representation. In atomic 
orbital methods, is sparse but has known limitations 
in accuracy due to the inflexible nature of the basis set. 
In addition, even if ^k-i is localized, is not sparse in 
general since the Chebyshev filtering procedure fills in the 
matrix. In Ref fiSl Motamarri et al. use the localization 
technique introduced by CerverrP^ to build a localized 
basis for the Chebyshev filtered subspace and the efh- 
ciency relies on the possibility to maintain a sparse basis. 
Their work shows that finite-elements are most appropri¬ 
ate to exploit the advantages of both Chebyshev filtering 
and basis localization. This is not possible if high-order 
finite-differences or plane waves are employed to compute 
the derivatives. When using high-order finite-differences, 
the density of the subspace matrix increases rapidly with 
the degree of the Chebyshev filter due to the far reaching 
high-order stencils. 

Here, we address the second bottleneck in Table by 


showing that it is unnecessary to fully diagonalize H in 
order to populate the Kohn-Sham states correctly. Sup¬ 
pose that Noc Kohn-Sham states are fully occupied and 
that Nfr are fractionally occupied. We claim that only 

the Nfr largest eigenpairs of H are actually required 
for the KS-DFT, and this method is named the partial 
Rayleigh-Ritz (pRR) procedure. To see this, consider 


c = 

CocCfr 

A = 

' Aoc 0 


0 A 


( 22 ) 

(23) 


where C is the matrix of eigenvectors of H. Then 


npoi^) 



0 

npoi-^fr) 


= 1 + 


0 0 

0 nFpt{A.fr) I/rac 


(24) 

(25) 


where Iqc/ fr is a ^ identity matrix. Using 

the last equation and the fact that C is unitary, Eq.(15) 
can be transformed into 


P = U 

V-i 


'( 


I -I- Cfr [nFoi^fr) — I/r] Cj,, 
“t” {nFoi-^fr') I/r) 


-T 

u 


(26) 

(27) 


From the last equation, it appears that only the Nfr 
largest Ritz-values are required to evaluate the Fermi- 
Dirac operator. The density matrix is the inverse of 
the overlap matrix plus a rank-Aj^ correction in which 

the largest eigenvectors of H appear. In a large sys¬ 
tem, Nfr is generally much smaller than Noc and is more 
of less constant with respect to the system size. For 
example, our tests show that Nfr ~ 8 — 32 whereas 
Noc ~ 8, 000 — 12, 000. Since only a few eigenvectors 
are required, iterative eigensolvers can be used to com¬ 
pute Cfr- Like the Rayleigh-Ritz algorithm, this partial 
Rayleigh-Ritz procedure works whether the subspace $ 
is orthonormal to begin with or not. It is generally faster 
to orthonormalize it inside the Rayleigh-Ritz procedure 
and not ahead of it. The partial Rayleigh-Ritz procedure 
is summarized in Algorithm]^ 


Algorithm 7 Partial Rayleigh-Ritz procedure 
procedure PartialRayleighRitz(H,$) 

Compute H = 
if / I then 
Compute S = 
else 

S = I ^ 

Diagonalize HC/^ = C/rA/^ 

Compute ^ = #U 

Compute P = I -I- Cfr {upniJ^fr) — I/r) 

return P 


To close, the partial Rayleigh-Ritz algorithm differs 
from the traditional Rayleigh-Ritz algorithm in two 
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ways from a computational perspective. Firstly, in the 
Rayleigh-Ritz algorithm, the current Chebyshev filtered 
subspace is multiplied by the eigenvectors of the pro¬ 
jected eigenvalue problems, a general matrix. In the 
partial Rayleigh-Ritz algorithm, the current Chebyshev 
filtered subspace is multiplied by the inverse of the 
Cholesky factor of the overlap matrix, a triangular ma¬ 
trix. In the case where the Chebyshev filtered subspace 
is already orthonormal, nothing needs to be done. This 
halves the computational cost associated with updating 
the subspace. In exact arithmetic, this could even be 
avoided altogether, but in practice the basis vectors of 
the subspace become linearly dependent and they 
must be periodically orthonormalized. Whether this is 
necessary can be monitored by looking at the condition 
number of the overlap matrix. Another distinction is 
that only a few eigenvalues and eigenvectors of the pro¬ 
jected eigenvalue problem are necessary to build the one- 
particle density matrix and obtain the electronic density 
for the KS-DFT. This reduces the computational com¬ 
plexity associated with the diagonalization from 0{N^) 
to - C>(iV 2 ). 

Furthermore, we argue that the partial Rayleigh-Ritz 
algorithm is easier to parallelize than the Rayleigh-Ritz 
algorithm. Parallelizing the eigenvalue problem Eq. 0 
is tantamount to re-implementing ScaLAPACK routines 
or some other linear algebra library for distributed mem¬ 
ory computers, a daunting task physicists wish to avoid. 
In contrast, in pRR it suffices to parallelize matrix prod¬ 
ucts of the type Hit to parallelize the eigenvalue problem 
in Algorithm The parallel matrix-vector routine can 
then be used in a partial diagonalization algorithm such 

as lobpcgP. 


VI. THE RESCU IMPLEMENTATION 


In this section, we report the implementation of the 
KS-DFT solver RESCU. The code is written in MAT- 
LAB and includes interfaces to MPI libraries, ScaLA¬ 
PACK, CUBA, cuSPARSE and LibXC written in C and 
compiled into MEX-files. 

The Kohn-Sham equation is discretized on a uniform 
Cartesian grid. High-order finite-differencing is used to 
discretize the differential operators. We generally use 
0{h^^) stencils (49-point stencils) where h is the grid 
resolution. Beyond that the Vandermonde system of 
equations determining the stencil coefficients becomes ill- 
conditioned. Moreover, the entries delimiting the stencil 
become negligibly small and there is no gain in trying to 
increase the accuracy further. Matrix representations of 
the first and second order differential operators are gen¬ 
erated for each coordinate. Differencing matrix-vector 
products take the following form 


p)(n) _ / N . 

(28) 

dx\ I 


Here, the derivative is taken with respect to the first 
coordinate. More generally, is a order discrete 
differential operator operating along the coordinate 
and xl is the grid point along the coordinate. Such 
products can be implemented as matrix-matrix products 
by making the array / into a matrix in which the first 
dimension runs over the differentiated coordinate and the 
second dimension the other coordinates. iV-dimensional 
finite-difference operators are Kronecker products of the 
form 




dx 


(") 


1-1 


i=l 


N 

(g) I. 

i=j + l 


(29) 


where 1 ^ is the identity operator along the i*^ dimen¬ 
sion. The matrix representation of the operator defined 
in Eq.(29) need not be built explicitly. It suffices to per¬ 


form certain manipulations on the function array, array 
dimension permutation and transposition for instance, to 
make equation |28| into a matrix product. We implement 
gradients and Laplacians applications as these particular 
Kronecker products as we found this was most efficient. 
We have compared it against using the multi-dimensional 
(sparse) Laplacian, using the stencils directly and using 
Eourier transforms among others. Eor large systems it 
may be advantageous to put the differencing matrices 

j)(") 

in a sparse format. A sparsity of roughly 0.05 was 
observed to be the turning point. Eor example, for an 
0{h^^) stencil it would be advantageous to use sparse 
differential operators if the number of points along a di¬ 
mension is larger than 300. 

We use a pseudopot ential set generated from the 
Troullier-Martins schem^^sMI ^§0 |;]^0 Kleinman- 
Bylander representatioiP^ to model the atomic cores. A 
core correction is added as prescribed in Ref. [^for ele¬ 
ments in which the core shells overlap significantly with 
the valence shell. The set was developed for the NAO 
quantum transport package NanodcaP^ and it includes 
double-zeta polarized atomic orbitals. The Hartree po¬ 
tential of the spherically symmetric valence atomic or¬ 
bital charge is added to the pseudopotentials, and hence 
screens long range Coulomb tails. Corrections to the 
Hartree potential are calculated by solving the Poisson 
equation for the deviation from the neutral atom density. 
Eourier transforms are used in periodic systems and sine 
transforms are used to diagonalize the finite-difference 
Laplacian in Dirichlet problems. 

Our software implements the time -saving double-grid 
technique of Tomoya and KikujP^P^. It is used to com¬ 
pute certain integrals, such as the projection of the 
Kohn-Sham states onto the Kleinman-Bylander projec¬ 
tors, at a lower cost. The Kohn-Sham states are typi¬ 
cally smoother than the pseudopotentials. The idea is to 
express the Kohn-Sham states on a coarse grid and inter¬ 
polate them on a finer grid used for the pseudopotenials 
when needed, saving memory and time as a result. A few 
exchange-correlation functionals have been implemented 
in MATLAB: PW9#^ (LDA), PBEP (GCA) and MB^Hl 
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(mGGA). More exchange and correlation functionals are 
available from LibXCpS. The interface from MATLAB 
to LibXC is written in G and allows us to use most LDA, 
GGA and meta-GGA functionals implemented in the li¬ 
brary. 

The Ghebyshev filtering technique originally proposed 
by Zhou et al. is significantly impeded by the initial 
subspace generation which requires solving for all Kohn- 
Sham eigenstates. In a recent paper, the authors show 
that starting from a random subspace and performing a 
few filtering steps is sufficient to obtain a suitable initial 
subspac^^. In our implementation, yet another option 
is available: a single- or double-zeta atomic orbital basis 
is used as an initial subspace. The lowest energy levels 
of that subspace are found by partly diagonalizing the 
projected Hamiltonian and then a quasi-minimal initial 
subspace is constructed. Moreover, it is possible to reach 
convergence to a prescribed accuracy in the NAO basis 
before transposing the calculation to real space. This 
provides a more robust and quick convergence and alle¬ 
viate significant computational cost. Our tests show that 
even using a single-zeta atomic orbital basis can generally 
give a good initial subspace. 

Gertain elements with d and / electrons have hard 
pseudopotentials and the resolution of the grid must be 
increased accordingly. It is generally easier to perform 
calculations at a low resolution since the convergence 
rate of non-linear accelerators generally deteriorates with 
respect to system size as shown by Lin and coworkers 
in RefI551 We have implemented a “multi-grid” Kohn- 
Sham solver: it solves the equation on a coarse grid first 
and gradually refines the grid until some target resolu¬ 
tion is reached. After completing the calculation at a 
given resolution, the code interpolates the pseudopoten¬ 
tials and neutral atom electronic density and interpolates 
the current approximation of the Kohn-Sham subspace 
on the refined grid. The density is then updated and a 
new self-consistent cycle is initiated. We implemented a 
few mixing schemes to accelerate the convergence of the 
density or the effective potential. In particular, we im¬ 
plemented Broyden mixing as proposed by Srivastava in 
RefIMl and Johnson mixinj^as presented by Kresse and 
Furthmiiller in Ref. [T^ 


The parallelization is done with MPI and ScaLA- 
PACK. MATLAB also naturally takes advantage of In¬ 
tel’s MKL threading capabilities. The MPI-only imple¬ 
mentation is based on the MPI-|-ScaLAPAGK implemen¬ 
tation, and hence we refer to ScaLAPAGK in the descrip¬ 
tion that follows. Our implementation uses a 2D block 
cyclic distribution to scatter the arrays across many pro¬ 
cesses. This is a quite general distribution scheme as far 
as matrices are concerned. The distribution depends on 
four parameters: two specifying the size of the blocks, 
two specifying the size of the process grid. Some ex¬ 
amples of a 60 X 10 matrix shared between 4 processes 
are shown in Fig. ^ Fig. 2(a) shows a matrix split 


in 10 X 10 blocks distributed on a 4 x 1 process grid, 
which we refer to as a tall process grid. In contrast, a 




(a) Tall (b) Flat (c) Square 

process process process 

grid grid grid 

FIG. 2: The 2D block cyclic distribution is a general scheme 
used to distribute arrays in RESCU. The array distribution 
for different block sizes and process grids is depicted. The 
numbers indicate the rank of the process holding the subma¬ 
trices. 


flat process grid is distributed as shown in Fig. |2(b)[ 
where the blocks are 60 x 2 submatrices. Finally, a ma- 
trix distributed on a square process grid is depicted Fig. 
|2(c)[ where the blocks are 4x4 submatrices. The block 
size must be chosen large enough to limit communication 
between processes but small enough to yield a good load 
balance. The performance is not that sensitive to block 
size in our experience; a block size between 16 and 256 
is usually efficient on an InfiniBand network. It is much 
more sensitive to the shape of the process grid however. 
For certain operations, the computational time can vary 
as much as 100%. It is more often than not favorable 
to take the time to redistribute optimally an array using 
PDGEMR2D before performing an operation. In the fol¬ 
lowing description, we indicate which one of a tall, flat or 
square process grid is likely to yield the best performance. 


The load associated with interpolating spatial vari¬ 
ables such as potentials, densities and atomic orbitals 
is distributed according to a spatial partitioning which 
corresponds to a tall process grid (Fig. 2(a)). A good 
load balance is achieved since the number of real space 
grid points is always by far superior to the number of 
processes and the communication cost is negligible since 
interpolation is a local operation. 


A significant computational cost is associated with 
Hamiltonian-wavefunction products. Such products oc¬ 
cur in the Lanczos solver which is used to compute the 
eigenspectrum upper bound (see Section IV). The cost 
associated with the Lanczos solver is marginal but it can 
be parallelized by using ScaLAPAGK with a few (e.g. 
four) processes or by threading the operations. Filtering 
the Kohn-Sham subspace is computationally expensive in 
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comparison. We found that parallelizing over the Kohn- 
Sham states is often the most efficient approach since 
the processes do not need to communicate during the fil¬ 
tering procedure. We thus distribu te the subspace 
array using a flat process grid (Fig 2(b)) prior to that 
operation. This is adequate as long as the number of 
electrons in the simulated system is larger than the num¬ 
ber of processes. It is more difficult to achieve a good 
load balance in large systems with relatively few elec¬ 
trons such as ‘hollow’ molecules and 2D crystals. In this 
case, a parallelization scheme as described in RefMH will 
likely perform better. One other option is to use less pro¬ 
cesses but use threading to parallelize the Hamiltonian- 
wavefunction products. 


Another computational bottleneck comes from or- 
thonormalizing the Kohn-Sham subspace. Cholesky or¬ 
thonormalization can be used, but it is performed by 
the more robust QR factorization routines PDGEQRF 
and PDORGQR if required. We have observed that ma¬ 
trix decomposition and diagonalization routines gener¬ 
ally perform best on arrays split in squ are bl ocks and 
distributed on a square process grid (Fig |2(c)[ ). Gonse- 
quently, we redistribute the subspace as it yields a good 
performance in our tests. A significant cost is associ¬ 
ated with calculating the projected Hamiltonian and the 
overlap matrix. A tall process grid offers the best per¬ 
formance for this operation when using MPI or ScaLA- 
PACK. The Rayleigh-Ritz procedure is completed by 
calling PDSYEV or PDSYGVX depending on whether 
an orthonormalization of the subspace has taken place 
before. In the Rayleigh-Ritz procedure, PDSYGVX finds 
all the eigenvectors. If the partial Rayleigh-Ritz proce¬ 
dure is used, then PDSYEVX or PDSYGVX is invoked 
to find the few required eigenvectors. We have also mod- 
ihed the ARPAGK!^ interface provided with MATLAB 
and Knyazev’s MATLAB implementation of LOBPGCP^ 
to use the parallel matrix-vector products mentioned at 
the end of section El 


VII. NUMERICAL TESTS 


Our numerical tests are performed at McGill Univer¬ 
sity’s Gentre for High Performance Gomputing (HPG). 
We use nodes that consist of two Intel E5-2670 processors 
(8-core, 2.6 GHz, 20MB Gache) and 128 GB of DDR3 
memory. The internode communication link is Infini¬ 
Band QDR. We use OpenMPI 1.8.3 and ScaLAPACK 
2.0.2. In our tests, we set the number of processes equal 
to the number of cores and turn off threading. The tim¬ 
ings reported below are wallclock times. The results for 
some of the largest systems are compiled in table [Til 


A. Real Space RESCU 

1. Partial Rayleigh-Ritz 

As mentioned earlier, the partial Rayleigh-Ritz algo¬ 
rithm is most competitive when the cost of the diagonal¬ 
ization taking place in the Rayleigh-Ritz procedure be¬ 
comes significant. We evaluate the potential gain associ¬ 
ated with diagonalization by benchmarking the ScaLA- 
PAGK routines PDSYEV, PDSYEVX and our parallel 
version of ARPAGK, which we call “ScaARPAGK” here. 
We seek the 16 largest eigenvalues of random matrices of 
varying sizes. This is a typical number of buffer states re¬ 
quired in the partial Rayleigh-Ritz algorithm in the sim¬ 
ulation of gapped systems (e.g. Si). The time is averaged 
over 10 randomly generated symmetric matrices for each 
size. The ScaLAPAGK routines use an 8 x 8 process grid 
and 32 x 32 blocks. ScaARPAGK uses a homemade func¬ 
tion that carries out the parallel matrix-vector products. 
The matrix is scattered according to a 64 x 1 process 
grid and 16 x 16 blocks. We have diagonalized matrices 
up to linear size 16,384, 32,768 or 65,536 with PDSYEV, 
PDSYEVX or ScaARPAGK respectively. The results are 
plotted in Fig. i The scaling is C>(V^'^) for the ScaLA¬ 
PAGK routines and almost linear for ScaARPAGK. This 
is better than theoretical asymptotic scaling in all cases. 
This reflects the importance of the communication cost in 
the case of ScaLAPAGK and the large overhead of our im¬ 
plementation in the case of ScaARPAGK. Both ScaLA¬ 
PAGK routines share roughly the same scaling but the 
routine PDSYEVX is about an order of magnitude faster 
than PDSYEV since it stops when the wanted eigen- 
pairs have been found. ScaARPAGK starts winning over 
PDSYEVX when the matrix to be diagonalized is larger 
than 8, 000 x 8, 000. The speed-up for the diagonaliza¬ 
tion becomes substantial when the number of electrons 
is equal to or greater than 32,000 (after accounting for 
spin degeneracy). 


2. Subspace Initialization 

Next, we demonstrate the importance of the quality of 
the initial subspace on convergence when using Cheby- 
shev filtering in lieu of an eigensolver. In Ref. [501 the 
authors suggest starting from a relatively accurate set of 
eigenvectors of the Hamiltonian. It was later demon¬ 
strated that using a few Ghebyshev accelerated steps 
of the power method on a random initial subspace was 
sufficient to simulate many systems more efficientljl^. 
We thus compare the atomic orbital (NAO) initialization 
against the Ghebyshev filtering initialization as described 
in RefiOOl Single-zeta atomic orbitals are interpolated on 
the real space grid and the Rayleigh-Ritz procedure is 
performed using the resulting subspace to obtain the ini¬ 
tial subspace. In that sense, we are performing a one shot 
NAO calculation and then we project the result on the 
real space grid. For the Ghebyshev filtering initialization, 
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System 

# of atoms (e ) Nx Ny N^ 

Subspace (L) Method ^ 

: cores 

Time (hrs) 

Si 

5,832 (23,328) 

140 140 140 

11,672 

RS 

256 

5.52 

A1 

4,000 (12,000) 

110 no no 

8,044 

RS 

64 

5.09 

A1 

8,788 (26,364) 

141 141 141 

17,596 

RS 

256 

23.88 

Gu 

1,372 (15,092) 

156 156 156 

8,058 

RS 

256 

9.12 

DNA-H 2 O 

5,399 (14,596) 

170 168 148 

7,314 

RS 

256 

9.62 

Si 

13,824 (55,296) 

247 247 247 

55,296 

AO 

64 

6.43 

Gu 

5,324 (58,564) 

267 267 267 

95,832 

AO 

256 

13.42 


TABLE II: Some of the largest physical systems solved by KS-DFT with RESCU. The number of electrons in the system is 
indicated in the parentheses beside the number of atoms. The vector [Nx, Ny, N^] gives the numbers of points used along each 
dimension. For the AO method, [Nx, Ny, N^] is the size of the real space grid used to project the orbitals and calculate the 
density. It is also the grid by which the Poisson equation is solved for the Hartree potentials. The subspace dimension L 
corresponds to the linear dimension of the eigenvalue problem. In the method column, RS stands for real space and AO for 
numerical atomic orbital. The time is the total wall-clock time to converge the entire KS-DFT computation. More details 
about the computation are found in section ivnl 




FIG. 3: Time as a function of matrix size for full and par¬ 
tial diagonalization (16 eigenpairs). The timings for the 
ScaLAPACK routines PDSYEV and PDSYEVX, used in the 
Rayleigh-Ritz and partial Rayleigh-Ritz algorithms, are rep¬ 
resented by the circles and crossed circles respectively. The 
timings for the parallelized ARPACK function (ScaARPACK) 
are denoted by the squares. 


4 Chebyshev filtering steps are used. 

In order to compare the methods, the density of a 
unit cell comprising 216 silicon atoms is calculated. The 
number of iterations to convergence as a function of 
the Chebyshev filter degree (CFscf) used in the self- 
consistent loop is plotted in Fig. We use 16 buffer 
states, Broyden acceleration with a mixing fraction of 
0.3 and the convergence criteria are < 10“^, 

j < 5 X 10“®. The number of iterations to con¬ 
verge the density for CFq = 16 and CFscf = 8 is missing 
as the density did not converged within 100 iterations. 
CFscf = 8 also yields the worst performance for all ini¬ 
tialization methods. This illustrates that the robustness 
is partly determined by the degree of the Chebyshev fil- 


FIG. 4: Number of iterations to converge the density of a unit 
cell containing 216 Si atoms as a function of self-consistent 
Ghebyshev filter degree for different initialization techniques. 
GFo stands for the degree of the Chebyshev hlter used in the 
initialization of the subspace (4 filtering steps are used). 
NAO stands for initialization from the solution of a single-zeta 
atomic orbital basis. The total number of Hamiltonian-wave 
function products is written by the data points. The circle, 
“plus” circle and “times” circle marks are for CFq = 8, GFo 
= 16 and CFq = 24 respectively. The square marks are the 
the NAO initialization. 


ter. NAO initialization leads to a faster convergence for 
all values of CFscf except for CFscf = 8 in which case 
the CFq = 24 initialization leads to the smallest iteration 
count. However, the number of Hamiltonian-subspace 
products remains superior to the NAO case as indicated 
beside the marks in Fig. Otherwise, the degree of 
the filter in the initialization step does not seem to im¬ 
pact convergence much in the present test. For CFscf 
larger than 12, the number of self-consistent steps to con¬ 
verge stagnates and the number of Hamiltonian-subspace 
products increases more of less linearly accordingly. Us- 
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FIG. 5: Time per self-consistent step as a function of the 
number of Si atoms. The scaling with respect to the number 
of atoms is approximately 0{N^'^). The triangles, squares 
and circles represent data points for calculations performed by 
16, 64 and 256 cores respectively. The solid and dashed lines 
are for the Rayleigh-Ritz algorithm and the partial Rayleigh- 
Ritz algorithm. For the 256-core results, the numbers by the 
data points (red open circles) are the total wall-clock time for 
converging the entire KS-DFT run. 


ing NAO initialization, the inflation of the number of 
Hamiltonian-subspace products is not as important be¬ 
cause the number of self-consistent steps keeps decreasing 
although not enough to compensate the cost of a high- 
degree filter. The optimal filter degree remains 12 for 
all methods for this system. We thus recommend using 
NAO initialization as it may converge faster and it is ap¬ 
preciably cheaper than Chebyshev filtering initialization 
in large systems by virtue of the localized character of 
the atomic orbitals (see Fig. [^and Fig. |^. The initial 
subspace can also be improved by using a multiple-zeta 
basis. 


3. Bulk Silicon Supercells 

Having tested various mathematical procedures, we 
now turn to physical systems of interest. Unit cells of sil¬ 
icon of varying size are simulated to test the performance 
of RESCU, the partial Rayleigh-Ritz algorithm and the 
Rayleigh-Ritz algorithm. The states are assumed to be 
spin-degenerate and 8 buffer states are included to sep¬ 
arate the occupied and unoccupied spectra as explained 
in Section EYi Kleinmann-Bylander projectors up to an¬ 
gular momentum L = 1 are used. The grid spacing lower 
bound is set to 0.66 Bohr which corresponds to an energy 
cutoff of 300 eV. The differential operators are generated 
using 16*^ order stencils. The exchange-correlation terms 
are computed using Perdew and Wang’s version of LDAI^ 


as implemented in LibXC. We use 15 steps of the Lanczos 
algorithm to calculate the eigenspectrum upper bound 
and a Chebyshev filter of degree 16. Broyden mixing with 
a mixing fraction of 0.3 and a history of 20 is employed 
to accelerate convergence. The convergence criteria are 
again < 10“®, < 5 x 10“®. For the 

partial Rayleigh-Ritz benchmark, we use the partial di- 
agonalization capabilities of ScaLAPACK (PDSYGVX) 
and we find 12 hole eigenvalues (i.e. 8 buffer states plus 
4 occupied states). Many of these parameters can be 
optimized further to yield a better performance: the pa¬ 
rameters used here are by no means optimal - we sim¬ 
ply used sensible values based on experience - but they 
already give impressive performance. Although this is 
suboptimal, we also keep the parameters constant while 
varying the number of processes to get a consistent and 
fair comparison. 

In Fig. the time per self-consistent step is plotted as 
a function of the number of atoms. The calculations were 
carried out using 16, 64 and 256 cores. We could go up to 
5,832 Si atoms before running out of memory (the next 
cubic Si supercell contains 8,000 Si atoms). The total 
time to convergence (in seconds) for the 256-core runs is 
written by the data points in Fig[^ We observe that, at 
to those sizes, the partial Rayleigh-Ritz algorithm leads 
to marginal gains. There are many reasons to this. Most 
importantly, the pRR gains are masked by the large cost 
of projecting the Hamiltonian into the hltered subspace 
and computing the overlap matrix. Hence, even though 
pRR is always faster (see Figj^, it gives marginal gains 
for this particular test. We shall discuss this issue in 
more details in section |VHI| We stress that pRR is mea¬ 
sured against the highly efficient parallel linear algebra 
library ScaLAPACK. In the case where no such library is 
available, pRR provides a convenient way to parallelize 
the computation of the projected density matrix and can 
lead to substantial time savings in smaller physical sys¬ 
tems. Finally, we note that the computational time scales 
consistently as for all processor counts. We also 

highlight that the parallelization efficiency approaches 
100 % as the number of atoms in the system increases. 

4- Bulk Aluminium Supereells 

We now turn to a metallic system for our second test: 
aluminium supercells. Again, we perform the calcula¬ 
tions using 16, 64 and 256 cores. The number of valence 
electrons per atom is 3 and the system is assumed to 
be spin-degenerate. The spatial resolution is 0.7 a.u. 
and the differential operator accuracy is 0{h^^). We 
compute the states occupancies using the Fermi-Dirac 
distribution with a temperature of 1,000K. The compu¬ 
tation is performed within the LDA using the routines 
XC_LDA_X and XC_LDA_C_PW from LibXC as in the 
previous benchmark. We chose a Chebyshev filter of de¬ 
gree 16. The number of buffer states varies with respect 
to system size. This is necessary to open a sizeable gap 
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64 256 1,024 4,096 


Number of atoms 

FIG. 6: Time per self-consistent step as a function of the 
number of A1 atoms. The scaling with respect to the number 
of atoms is approximately 0{N‘^). For the 256-core results, 
the numbers by the data points (red open circles) are the total 
wall-clock time for converging the entire KS-DFT run. 


between the occupied and unoccupied states as explained 
in section |IVj In the present test, we set the number of 
extra states to 10%-30% the number of ions. Finally, 
we used Johnson-Kerker mixing with a mixing fraction 
of 0.5 and a minimal mixing fraction of 0.05. The time 
per self-consistent step as a function of the number of 
A1 atoms is displayed in Fig. The computational 
cost per self-consistent step scales almost quadratically 
(0(N^-^)) with respect to the number of atoms. With 
64 cores, the largest supercell simulated contained 4,000 
A1 atoms and the density and total energy where con¬ 
verged to one part in 10® in slightly over 18,000 seconds 
using 256 cores. A supercell containing 8,788 A1 atoms is 
handled with 256 cores. After 29 iterations and almost 
24 hours of computation, the residuals are the following: 


\\Pk pfc —ill ^ 1 n 
N — -LU 


—4 ||£'fc—-£^fc-i|| 


~ 5 X 10- 


^ ||i;,+i;,_d| - - —- . The density 

convergence criterion is not as restrictive as one used in 
the other benchmarks. We note, however, that it is al¬ 
ready constrained enough for band structure calculation 
purposes as shown in Fig|9(c) below. 


5. Bulk Copper Supercells 

We briefly report input parameters and results for an¬ 
other metal test: copper supercells. In this test we used 
256 cores. The number of valence electrons per atom is 
11 and the system is assumed to be spin-degenerate. The 
spatial resolution is 0.3 a.u. and the differential operator 
accuracy is 0{h^^). We compute the states occupancies 
using the Fermi-Dirac distribution with a temperature 
of lOOK. The computation is performed within the LDA 
using the routines XC_LDA_X and XC_LDA_C_PW from 



FIG. 7: DNA molecule solvated in 1,713 water molecules. 


LibXC as in the previous benchmark. We chose a Cheby- 
shev filter of degree 16. In the present test, we set the 
number of extra states to 50% the number of ions. Fi¬ 
nally, we used Johnson-Kerker mixing with a mixing frac¬ 
tion of 0.25 and a minimal mixing fraction of 0.1. The 
largest supercell simulated contained 1,372 Cu atoms and 
the density and total energy where converged to one part 
in 10® in 32,843 seconds using 256 cores as reported in 
table mi 


6. DNA molecule in water 

As another example of testing on physical systems, 
RESCU is applied to calculate the electronic structure 
of a solvated DNA structure (5’-AAAA-3’) which is a 
completely disordered system. The initial structure is 
obtained from minimizing the structure with the molec¬ 
ular modeling package AMBER 11. The DNA structure 
is initially charge neutralized with counterions by 6 Na"*" 
and solvated with 1,713 TIP3P water moleculeJ^. The 
system is depicted in Pig. [71 The tetragonal simulation 
domain has dimensions 44.5 x 44.4 x 39.lA and periodic 
boundary conditions. We use a resolution of 0.25 Bohr. 
A total of 5,399 atoms (14,596 electrons) were simulated 
using 256 cores in the setup described above. We use 
the LibXC XC_GGA_X_OPTPBE_VDW exchange func¬ 
tional and XG_GGA_G_OP_B88 correlation functional. 
The Laplacian is discretized using 0{h^^) stencils. We 
used a 16*^ order Chebyshev filter and 16 buffer states. 
The convergence criterion is < 10“® and we 

mix the density using the Pulay method with a mixing 
fraction of 0.1. We initialize the subspace using a single- 
zeta atomic orbital basis set with an angular momentum 
cutoff L = 1. The electronic density converged in 20 
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FIG. 8: Time per self-consistent step as a function of the 
number atoms. The circles are for the Cu benchmark which 
is performed using a double-zeta atomic orbital basis with 
angular momentum cutoff L — 2 and 256 cores, each having 
8GB of DDRS memory. The squares are for the Si benchmark 
which is performed using a single-zeta atomic orbital basis 
with angular momentum cutoff L = 1 and 64 cores, each 
having 4GB of DDRS memory. The scaling with respect to 
the number of atoms is linear for smaller systems 

and gradually ramps up to cubic The numbers by 

each data point is the total wall-clock time for converging the 
entire KS-DFT run. 


steps which took a total of 34,638 seconds for an average 
time of 1,732 seconds per self-consistent step. Finally, we 
find that the gap between the highest occupied molecu¬ 
lar orbital (HOMO) and the lowest unoccupied molecu¬ 
lar orbital (LUMO) shrinks to 0.6eV - the gap for the 
isolated DNA structure without water is 2.0eV - which 
indicates that the solvent plays an important role in the 
optical properties of wet DNA^^. We shall present the 
comparison in a forthcoming articl^^. 


B. Atomic Orbital RESCU 

To demonstrate the capabilities of the atomic orbital 
method implementation in RESCU, we perform the same 
benchmark as above (bulk Si supercells) using numerical 
atomic orbitals. We use the same processors (E5-2670) 
equipped with half the memory this time (4GB/core). 
We perform the benchmark with 64 cores only. We also 
use a real space resolution of 0.5 a.u. A single-zeta basis 
with angular momentum cutoff L = 1 was used such that 
there are 4 atomic orbitals per atom. The results are plot¬ 
ted as blue squares in Eig. The largest simulated Si 
supercell comprises 13,824 Si atoms and the KS-DFT was 
converged in 23,161 seconds. Even at that size, we did 
not run into any memory issue since the atomic orbital 
basis yields sparse matrices. For relatively modest su¬ 


percells, RESCU scales linearly. The scaling deteriorates 
as the supercell size grows since the eigenvalue problem 
accounts for a larger and larger proportion of the compu¬ 
tational cost. The scaling gradually becomes cubic since 
we are treating the projected eigenvalue problem as a 
dense eigenvalue problem. Many methods evoked in Sec¬ 
tion [m] could improve the efficiency of NAO calculations 
further. Since diagonalization performance is so impor¬ 
tant in this test, we mention that a square processor grid 
and 64x64 blocks are used in the block cyclic distribution 
of the Hamiltonian and overlap matrices. 

We have also performed NAO computations for bulk 
copper supercells. For this benchmark, we use 256 cores 
with 8GB of memory per core. We use a real space res¬ 
olution of 0.25 a.u and a double-zeta basis with angular 
momentum cutoff L = 2 (18 atomic orbitals per atom). 
We have simulated supercells including up to 5,324 Cu 
atoms (58,564 electrons). The density and total energy 
were converged in roughly 12 hours (33 iterations). The 
time per self-consistent step as a function of the num¬ 
ber of Cu atoms is indicated by the red circles in Fig. 

The number of atomic orbitals per atom is 4.5 times 
larger than in the Si benchmark and the number of cores 
4 times larger. Consequently, the time per step for the 
Cu system is roughly an order of magnitude larger than 
the time per step for the Si system. The time per itera¬ 
tion scales similarly, i.e. it goes from linear in the 1,000 
atom range to cubic in the 10,000 atoms range. The total 
time scales worse here as larger (metallic) systems tend 
to take more iterations to converge. 


C. Accuracy 

To verify the accuracy of the electronic structures cal¬ 
culated in our benchmark, we calculated the band struc¬ 
tures of Al, Si and Cu using the VASP package (albeit us¬ 
ing a primitive cell) and compared it with the band struc¬ 
tures calculated from the density of the largest super¬ 
cell simulated. The overlayed RESCU and VASP band 
structures are displayed in Fig|^ The agreement is im¬ 
pressive considering that these methods are different in 
many respects. In particular, the PAW method is used 
in VASP and pseudopotentials are used in RESCU. We 
have also calculated band structures of compounds by 
both RESCU and VASP, results for the semiconductor 
GaAs and the insulator MgO are presented in Fig. 
Again, the agreement between the two methods is ex¬ 
cellent. This demonstrates the precision of the RESCU 
method for band structure calculations. 

A systematic approach to comparing DFT codes has 
been introduced by Lejaeghere et al.^^ in 2014. Its 
essence is to calculate the A-functional (see Eq|^ below) 
for the total energy versus volume (E vs V) equation of 
states (EOS) of elemental crystals. The equilibrium vol¬ 
ume Vo, the bulk modulus Bq and the derivative of the 
bulk modulus Bi can be extracted from the EOS using 
a third-order Birch-Murnaghan filP^. The value of A is 
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FIG. 9: Comparisons of band structures obtained by VASP 
and RESCU. The band structures agree to a high accuracy 
indicating that the resolution used in our benchmark was suf¬ 
ficient for the purpose of band structure calculations. 

thus a sensible measure of the agreement of the structural 
and mechanical predictions of two DFT solvers. The A- 
functional is defined as follows 

^'') = 1 ^ / dV {E\V) - E\V)) (30) 

n 

where Vl is an interval, E°-{y ) is the E vs V EOS for code 
a and E^{V) is the E vs V EOS for code b. The interval 
n is chosen as [0.94(Vo“ -h V^)/2, 1.06(Vo“ -h Kf)/2]. 

We have calculated the value of 
^^^RESCU^^wiEN 2 k^ foj. number of elements, 

where is the EOS calculated by RESCU and 

EWiEN 2 k jg EOS calculated by WIEN2lP. The 
WIEN2k EOS are provided in the supplemental content 
of Ref. [511 In RESCU, we used a grid spacing of 
0.14 Bohr and 6750/fV k-points for 7V-atoms unit cells. 




FIG. 10: Comparisons of the band structures of the GaAs 
and MgO compounds obtained by VASP and RESCU. 

The charge convergence criterion is 10“®e per valence 
electron and the energy convergence criterion is 10“® 
Hartree per valence electron. A Fermi-Dirac smearing 
with a temperature of 800 K is used. The computa¬ 
tion is performed within the GCA using the routines 
XC_GGA_X_PBE and XG_GGA_G_PBE from LibXC. 
Again, in the RESCU calculation the atomic cor es are 
modeled by Troullier-Martins pseudopotentialj ^^ l ^'^l 
The obtained A values are listed in Table M and the 
differences between the EOS are quite reasonable in all 
cases. These A values are comparable to those obtained 
with the electronic structure package Ablnit (using 
the Troullier-Martins pseudopotential) and WIEN2li^ 
Many codes include all elements from H to Rn ex¬ 
cept those between lanthanum and ytterbium in their 
test, but we leave such an exhaustive test to a future 
opportunity. 

We end this subsection by emphasizing that, while us¬ 
ing a more efficient solution process to solve the KS-DFT 
equation, there was no accuracy-degrading approxima¬ 
tion in RESCU and the accuracy tests presented here 
strongly demonstrate its quality. 


VIII. FURTHER DISCUSSIONS 

Having demonstrated the power of RESCU by solving 
the KS equation for thousands of atoms - both insulat¬ 
ing and metallic, both ordered and disordered, and on 
a modest computer cluster - we now discuss the princi- 
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Element A (meV/atom) 

Element A (meV/atom) 

H 

0.864 

Ca 

2.013 

Be 

7.080 

Cu 

7.758 

Mg 

1.392 

Rh 

18.627 

Al 

0.434 

Pd 

15.581 

Si 

3.608 

Ag 

11.286 

P 

7.291 

Cs 

1.082 

S 

6.830 




TABLE III: The A{E^’^^cu^ ^wiEN 2 k^ thirteen 

elements. 


pal bottlenecks of the real space method in RESCU. To 
this end, we use the results of the Si benchmark in Sec¬ 
tion |VII| for the discussion. We have plotted the time 
taken by the computationally intensive operations of Ta¬ 
ble |l] as a function of the number of atoms in Fig. IT 


they are the Chebyshev filtering, the Hamiltonian pro¬ 
jection onto the filtered subspace, the diagonalization 
of the projected pencil, the orthonormalization or/and 
computation of the Ritz vectors, and we add to this list 
the residual timing (ROC) which includes the remain¬ 
ing time. The timings for one self-consistent step per¬ 
formed by 256 cores are reported in Fig. 11 Both RR 
and pRR algorithms yield similar timings in all parts 
of the computation except for the orthonormalization. 
In solving the generalized eigenvalue problem, the eigen- 
solver computes the Cholesky factor of the overlap ma¬ 
trix to reduce the generalized eigenvalue problem to a 
standard form. In the partial Rayleigh-Ritz algorithm, 
this factor is reused to orthonormalize the filtered sub¬ 
space. In the standard Rayleigh-Ritz algorithm, there is 
no point in doing so since the eigenvectors for the gener¬ 
alized eigenvalue problem will directly provide the Ritz 
vectors which are orthonormal by construction. But the 
latter matrix is a general one whereas the former is tri¬ 
angular, and hence the factor of two speed up observed 
in Fig. 13 The diagonalization is faster in the partial 
Rayleigh-Ritz procedure, but it is not too significant as 
it is as large as the efficiency fluctuations observed in 
our computation tests. There are a few explanations to 
this. Firstly, according to the benchmark in Figj^ direct 
diagonalization is relatively cheap for matrices smaller 
than 10,000 x 10,000 which is about the size of the ma¬ 
trices in our largest system. Secondly, in a relatively 
simple system of bulk silicon, the projected matrix pencil 
H-AS is more easily diagonalizable than the random ma¬ 
trices used in the benchmark described above since it is 
closer to being diagonal. Moreover, it becomes closer and 
closer to being “diagonal” as the electronic density - and 
the Kohn-Sham invariant subspace - approaches its hxed 
point. The scaling of the residual timing appears lin¬ 
ear and that of the Chebyshev filtering is quadratic. The 
main bottlenecks are the projection and the orthonormal¬ 
ization which scale almost cubically. This originates from 
the complexity of matrix-matrix multiplication which is 
0{N^) or depending on matrix size and imple¬ 

mentation. 



FIG. 11: Time per self-consistent step as a function of the 
number of Si atoms using 256 processors. The full line link 
data points for the Rayleigh-Ritz algorithm and the dashed 
lines link data points for the partial version. 


The main bottleneck comes about because of the mas¬ 
sive amount of data required to encode the subspace 
In order to improve the method described in this work, it 
is crucial to achieve some sort of subspace compression. 
One way to address this issue is to calculate a localized 
basis for the filtered subspace as done by Motamarri et al. 
in Ref. HHl where the authors show that localizing the ba¬ 
sis is key to achieving a subquadratic computational scal¬ 
ing, in particular when evaluating the Hartree-Fock ex¬ 
change functional. Tangentially, subspace compression is 
crucially needed because the memory requirement scales 
as 0{N'^). Indeed, the RESCU method is so far more lim¬ 
ited by the memory requirement than the computational 
requirement as evidenced in the Si benchmark. Unfortu¬ 
nately, it appears that, for the reported tests, the sub¬ 
space “resists” localization as it approaches convergence 
and performing part of the computation with a dense 
subspace is still required. In a similar spirit, RESCU can 
use multiple-zeta numerical atomic orbital bases to solve 
the Kohn-Sham equations. If the memory requirements 
are not too severe, the NAO solutions are projected into 
real space and the energy may be further minimized using 
Chebyshev filtering. In summary, future research should 
focus on the following points: a vector space spanning 
the Kohn-Sham occupied subspace must be computed ef¬ 
ficiently, storing such a vector space should require 0{N) 
memory (the data must be compressible somehow) and 
the projected density matrix should be efficiently com¬ 
putable. In any event, even with the existing bottlenecks, 
RESCU has already achieved impressive computational 
efficiency in solving the KS-DFT problem. 
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IX. CONCLUSIONS 

In this work, we have presented a powerful Kohn-Sham 
DFT solver, RESCU. The goal of the RESCU method is 
to predict electronic structure properties of systems com¬ 
prising many thousands of atoms using moderate com¬ 
puter resources. The computational efficiency is gained 
by exploiting four routes. First, in real space, Cheby- 
shev filtering is used to expedite the computation of an 
invariant Kohn-Sham subspace in large systems. This ap¬ 
proach essentially exploits the fact that when the Hamil¬ 
tonian is not yet converged, one does not need to solve 
the KS equation extremely accurately. Second, we devel¬ 
oped a NAO-based method to efficiently generate a good 
initial subspace which is necessary in the Chebyshev fil¬ 
tering paradigm. Third, by judiciously analyzing various 
parts of the KS-DFT solution procedure, RESCU gains 
efficiency by delaying the 0{N^) scaling to large N; and 
our tests showed that RESCU scales as up to the 

several thousand atoms level. Eourth, RESCU gains effi¬ 
ciency by various numerical mathematics and, in partic¬ 
ular, we introduced the partial Rayleigh-Ritz algorithm 
and showed it leads to efficiency gains for systems com¬ 
prising more than 10,000 electrons. The RESCU code is 
implemented in MATLAB such that it provides a con¬ 
venient prototyping and development environment. It 
is also easily installed on many platforms and architec¬ 
tures. Einally, we mention in passing that we have also 
implemented total energy and force calculation methods 
into RESCU, but we reserve the discussion of structural 


relaxation using RESCU for the future. 

We demonstrated that the RESCU method could per¬ 
form large scale KS-DFT computations using computer 
resources ranging from 16 to 256 cores. At the 5,000- 
15,000 atoms level, there are many important material 
physics problems to be investigated and we wish to report 
them in the near future. From the method development 
point of view, to deal with even larger systems, we find 
it is essential to compress the Kohn-Sham subspace to 
achieve better computational effectiveness and to relieve 
computer memory requirements. Sparse matrix repre¬ 
sentations can alleviate the problem to some extent but 
do not solve it entirely. We think the solution may lie in 
the hierarchical matrix approximations which are data- 
sparse structures for dense matrices. We wish to present 
these efforts in the future. Finally, it is tempting and 
interesting to extrapolate the RESCU ability to much 
larger systems - using current supercomputers. However, 
we caution that a naive extrapolation may not work since 
for much larger N, the computational burden shifts to¬ 
ward the parts of the algorithm that scale as 0{N^). We 
believe these are important topics of future research. 
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